{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction to Scikit-clean\n",
    "`scikit-clean` is a python ML library for classification in the presence of label noise. Aimed primarily at researchers, this provides implementations of several state-of-the-art algorithms, along with tools to simulate artificial noise, create complex pipelines and evaluate them."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example Usage\n",
    "Before we dive into the details, let's take a quick look to see how it works. scikit-clean, as the name implies, is built on top of scikit-learn and is fully compatible* with scikit-learn API. scikit-clean classifiers can be used as a drop in replacement for scikit-learn classifiers. \n",
    "\n",
    "In the simple example below, we corrupt a dataset using artifical label noise, and then train a model using robust logistic regression: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.8888888888888888\n"
     ]
    }
   ],
   "source": [
    "from sklearn.datasets import make_classification, load_breast_cancer\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.model_selection import train_test_split, cross_val_score\n",
    "\n",
    "from skclean.simulate_noise import flip_labels_uniform, UniformNoise, CCNoise\n",
    "from skclean.models import RobustLR\n",
    "from skclean.pipeline import Pipeline, make_pipeline\n",
    "\n",
    "SEED = 42\n",
    "\n",
    "X, y = load_breast_cancer(return_X_y=True)\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.30, random_state=SEED)\n",
    "\n",
    "y_train_noisy = flip_labels_uniform(y_train, .3, random_state=SEED)  # Flip labels of 30% samples\n",
    "\n",
    "clf = RobustLR(random_state=SEED).fit(X_train,y_train_noisy)\n",
    "print(clf.score(X_test, y_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can use scikit-learn's built in tools with scikit-clean. For example, let's tune one hyper-parameter of `RobustLR` used above, and evaluate the resulting model using cross-validation: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.8804533457537648"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.model_selection import GridSearchCV, cross_val_score\n",
    "\n",
    "grid_clf = GridSearchCV(RobustLR(),{'PN':[.1,.2,.4]},cv=3)\n",
    "cross_val_score(grid_clf, X, y, cv=5, n_jobs=5).mean()  # Note: here we're training & testing here on clean data for simplicity"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Algorithms"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Algorithms implemented in scikit-clean can be broadly categorized into two types. First we have ones that are *inherently* robust to label noise. They often modify or replace the loss functions of existing well known algorithms like SVM, Logistic Regression etc. and do not explcitly try to detect mislabeled samples in data. `RobustLR` used above is a robust variant of regular Logistic Regression. These methods can currently be found in `skclean.models` module, though this part of API is likely to change in future as no. of implementations grow.\n",
    "\n",
    "On the other hand we have *Dataset-focused* algorithms: their focus is more on identifying or cleaning the dataset, they usually rely on other existing classifiers to do the actual learning. Majority of current scikit-clean implementations fall under this category, so we describe them in a bit more detail in next section."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Detectors and Handlers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Many robust algorithms designed to handle label noise can be essentially broken down to two sequential steps: detect samples which has (probably) been mislabeled, and use that information to build robust meta classifiers on top of existing classifiers. This allows us to easily create new robust classifiers by mixing the noise detector of one paper with the noise-handler of another.\n",
    "\n",
    "In scikit-clean, the classes that implement those two tasks are called `Detector` and `Handler` respectively. During training, `Detector` will find for each sample the probability that it has been *correctly* labeled (i.e. `conf_score`). `Handler` can use that information in many ways, like removing likely noisy instances from dataset (`Filtering` class), or assigning more weight on reliable samples (`example_weighting` module) etc.\n",
    "\n",
    "Let's rewrite the above example. We'll use `KDN`: a simple neighborhood-based noise detector, and `WeightedBagging`: a variant of regular bagging that takes sample reliability into account."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.9181286549707602\n"
     ]
    }
   ],
   "source": [
    "from skclean.detectors import KDN\n",
    "from skclean.handlers import WeightedBagging\n",
    "\n",
    "conf_score = KDN(n_neighbors=5).detect(X_train, y_train_noisy)  \n",
    "clf = WeightedBagging(n_estimators=50).fit(X_train, y_train_noisy, conf_score)\n",
    "print(clf.score(X_test, y_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above code is fine for very simple workflow. However, real world data modeling usually includes lots of sequential steps for preprocesing, feature selection etc. Moreover, hyper-paramter tuning, cross-validation further complicates the process, which, among other things, frequently leads to [Information Leakage](https://machinelearningmastery.com/data-leakage-machine-learning/). An elegant solution to this complexity management is `Pipeline`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Pipeline\n",
    "`scikit-clean` provides a customized `Pipeline` to manage modeling which involves lots of sequential steps, including noise detection and handling. Below is an example of `pipeline`. At the very first step, we introduce some label noise on training set. Some preprocessing like scaling and feature selection comes next. The last two steps are noise detection and handling respectively, these two must always be the last steps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.9332712311752832\n"
     ]
    }
   ],
   "source": [
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.feature_selection import VarianceThreshold\n",
    "from sklearn.svm import SVC\n",
    "from sklearn.model_selection import ShuffleSplit, StratifiedKFold\n",
    "\n",
    "from skclean.handlers import Filter\n",
    "from skclean.pipeline import Pipeline         # Importing from skclean, not sklearn\n",
    "\n",
    "\n",
    "clf = Pipeline([\n",
    "        ('scale', StandardScaler()),          # Scale features\n",
    "        ('feat_sel', VarianceThreshold(.2)),  # Feature selection\n",
    "        ('detector', KDN()),                  # Detect mislabeled samples\n",
    "        ('handler', Filter(SVC())),           # Filter out likely mislabeled samples and then train a SVM\n",
    "])\n",
    "\n",
    "inner_cv = ShuffleSplit(n_splits=5,test_size=.2,random_state=1)\n",
    "outer_cv = StratifiedKFold(n_splits=5,shuffle=True,random_state=2)\n",
    "\n",
    "clf_g = GridSearchCV(clf,{'detector__n_neighbors':[2,5,10]},cv=inner_cv)\n",
    "\n",
    "n_clf_g = make_pipeline(UniformNoise(.3),clf_g)            # Create label noise at the very first step\n",
    "print(cross_val_score(n_clf_g, X, y, cv=outer_cv).mean())  # 5-fold cross validation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are two important things to note here. First, don't use the `Pipeline` of `scikit-learn`, import from `skclean.pipeline` instead. \n",
    "\n",
    "Secondly, a group of noise hanbdlers are iterative: they call the `detect` of noise detectors multiple times (`CLNI`, `IPF` etc). Since they don't exactly follow the sequential noise detection->handling pattern, you must pass the detector in the constructor of those `Handler`s."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "from skclean.handlers import CLNI\n",
    "\n",
    "clf = CLNI(classifier=SVC(), detector=KDN())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All `Handler` *can* be instantiated this way, but this is a *must* for iterative ones. (Use `iterative` attribute to check.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Noise Simulation\n",
    "\n",
    " "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remember that as a library written primarily for researchers, you're expected to have access to \"true\" or \"clean\" labels, and then introduce noise to training data by flipping those true labels. `scikit-clean` provides several commonly used noise simulators- take a look at [this example](./Noise%20SImulators.ipynb) to understand their differences. Here we mainly focus on how to use them.\n",
    "\n",
    "Perhaps the most important thing to remember is that noise simulation should usually be the very first thing you do to your training data. In code below, `GridSearchCV` is creating a validation set *before* introducing noise and using clean labels for inner loop, leading to information leakage. This is probably NOT what you want."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.9244216736531594\n"
     ]
    }
   ],
   "source": [
    "clf = Pipeline([\n",
    "    ('simulate_noise', UniformNoise(.3)), # Create label noise at first step \n",
    "    ('scale', StandardScaler()),          # Scale features\n",
    "    ('feat_sel', VarianceThreshold(.2)),  # Feature selection\n",
    "    ('detector', KDN()),                  # Detect mislabeled samples\n",
    "    ('handler', Filter(SVC())),           # Filter out likely mislabeled samples and then train a SVM\n",
    "])\n",
    "clf_g = GridSearchCV(clf,{'detector__n_neighbors':[2,5,10]},cv=inner_cv)\n",
    "print(cross_val_score(clf_g, X, y, cv=outer_cv).mean())  # 5-fold cross validation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can use noise simulators outside `Pipeline`, all `NoiseSimulator` classes are simple wrapper around functions. `UniformNoise` is a wrapper of `flip_labels_uniform`, as the first example of this document shows."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Datasets & Performance Evaluation\n",
    "Unlike deep learning datasets which tends to be massive in size, tabular datasets are usually lot smaller. Any new algorithm is therefore compared using multiple datasets against baselines. The `skclean.utils` module provides two important functions to help researchers in these tasks:\n",
    "\n",
    "1. `load_data`: to load several small to medium-sized preprocessed datasets on memory.\n",
    "\n",
    "2. `compare`: These function takes several algorithms and datasets, and outputs the performances in a csv file. It supports automatic resumption of partially computed results, specially helpful for comparing long running, computationally expensive methods on big datasets.\n",
    "\n",
    "Take a look at [this notebook](./Evaluating%20Robust%20Methods.ipynb) to see how they are used."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
